What is the main difference between classification and regression? Can you provide some examples?

Example

Suppose we want to classify wines according to their quality. We want simply distinguish good from bad wines. We will first give a look to wine white. Open in your environment the wine white data set and create a vector of the same length of the number of observations of the data set, equal to one if the variable “quality” is equal or greater then 6, zero otherwise. Append this vector as a column to the data and remove the variable quality.

What is the intuition behind?

Solution

wine.white <- read.table("./data/wine.white.txt")

y <- rep(0, dim(wine.white)[1])
y[as.numeric(wine.white$quality) >= 6] <- 1
data <- cbind(wine.white[,-12], y)

Suppose now we want to predict the quality of the wine using the amount of volatile acidity in the wine. How could we do?

Let’s try with OLS…

plot(y ~ volatile.acidity, data=data, col="red", main ="OLS" )
abline(lm(y ~ volatile.acidity, data=data))

Do you notice any problem?

Let’s suppose that:

Then

\[ E[Y] = P(Y=1)*1 + (1-P(Y=1))*0 \rightarrow E[Y|X] = P(Y=1|X) \]

Where \(P(Y=1|X)\) is bounded between 0 and 1.

plot(y ~ volatile.acidity, data=data, col="red", main ="Logistic" )
logistic <- glm(y ~ volatile.acidity,  data=data, family="binomial")
lines(predict(logistic, type="response")[order(volatile.acidity)]~sort(volatile.acidity), data=data, lwd=2, col='blue')

Theoretical framework

We want to find the classifier that minimizes the a certain loss function \(L(\hat{h}(X), Y)\).

In general the optimal classifier must satisfy the following disequality:

\[ E[L(\hat{h}(X), Y) | (Y,X)] < E[L(g(X), Y) | (Y,X)] \forall g(x) \neq h(x) \]

By the law of iterated expectations this can be written as:

\(E_{Y,X}[L(\hat{h}(X), Y)] = E_{X}[E_{Y|X}[L(\hat{h}(X), Y)]]\)

Considering K different classes \(C_i\) for Y, then

\[ = E_{(Y,X)}[\sum_{i=1}^K L(C_i, \hat{h}(X)) P(Y= C_i | X)] \]

And then we want to find:

\[ \underset{\hat{\hat{h}(X)}}{argmin} \sum_{i=1}^K L(C_i, \hat{h}(X)) P(Y=C_i | X) \]

If \(L(C_i, \hat{h}(X)) = 1 \forall C_i \neq \hat{h}(X)\) and \(L(C_i, \hat{h}(X)) = 0 otherwise\) (so called 0/1 loss) then

\[ \sum_{C_ii \neq h(x)} P(Y=C_i | X) = \sum_{C_i \neq h(x)} 1 - P(Y= h(X) | X) \]

The best predictor should maximize \(P(Y= h(X) | X)\), the so called posterior probability for a certain class \(C_i\).

In general

\[ \hat{h}(X) = \underset{C \in (c_1, ..., c_k)}{argmax} \sum_{i=1}^K P(Y= C | X) \]

Question:

Let’s suppose that you are given only Y and no covariates. You have 10 observations and 3 are of class 1, 2 are of class 2 and the remaining are of class 3. What is your optimal classifier?


Logistic Regression

With logistic regression:

\[ f_i(X) = P(Y = c_i | X) = \frac{e^{X\beta_i }}{1 + e^{X\beta_i }} \]

For Example:

\[ log \frac{P(Y = c_1 | X)}{P(Y = K | X)} = \beta_{01} + X \beta_1 \]

How would you estimate the coefficients?

Maximum Likelhood Estimation

What is the joint probability of all observations for a certain density function \(f\) under the assumption of i.i.d. observations?


\[ f(x_1,x_2, .... , x_n | \theta) = \prod_{i=1}^{i = n} f(x_i | \theta) = \Lambda(\theta;x_1,...x_n) \]

To simplify computations we take the logarithm of the function (could you explain why?)

\[ loglik(\theta; x_1,...,x_n) = \sum_{i=1}^{i = n} ln f(x_i | \theta) \]

And finally:

\[ \theta = \underset{\theta }{argmax} loglik(\theta; x_1,...,x_n) \]

Note that the MLE coincides with a Bayesian estimator given a uniform prior distribution of the parameters. By the Bayes theorem:

\[ P(\theta; x_1,..., x_n) = \frac{f(x_1,...,x_n | \theta) P(\theta)}{P(x_1,...,x_n)} \]

The denominator is independent from theta and so it is not relevant for the argmax and by maximizing the posterior probability the result is the same under the assumption of a uniform distribution for \(P(\theta)\).

Fitting logistic regression

In a two-class scenario the log likelhood function is:

\[ l(\beta) = \sum_{i=1}^{i = n} [y_i log(P(x_i, \beta)) + (1 - y_i) log(1 - P(x_i, \beta))] \]

Which is equal to

\[ \sum_{i=1}^{i = n} [y_i x_i \beta - log(1 + e^{x_i \beta})] \]

Note that \(\beta\) = (\(\beta_0\), \(\beta_1\)) and \(x_i = (1, X_i)\).

The idea is to maximize the likelhood function.

\[ \frac{dl(\beta)}{d\beta} = \sum_{i=1}^{i = n} x_i (y_i - p(x_i; \beta)) = 0 \]

The system has p+1 equations in \(\beta\). Are they linear equations for the coefficients?


Tricky question:

Is this a sufficient condition to find an absolute optimum in this specific circumstance?


Unfortunately it does not exist any analytical solution to this problem. Consequently we will solve the problem using a so-called gradient descent algorithm.

The idea is to minimize a certain cost function. For seek of simplicity we analyze the two-class scenario, but you are asked to be able to work in a multi-class scenario.

\[ J(\theta) = \sum_{i=1}^{i = n}(L(h_\theta(X_i), y_i)) \]

\(L(C_i, \hat{h}(X)) = -log(h_\theta(x)) \forall y_i = 1\) and \(L(C_i, \hat{h}(X)) = -log(1 - h_\theta(x)) \forall y_i = 0\)

Because y is either equal to 1 or zero, it can be written in a compact form:

\[ J(\theta) = \frac{1}{N}\sum_{i=1}^{i = n} -log(h_\theta(X_i)) y - log(1 - h_\theta(X_i)) (1 - y_i) \]

How do we find the minimum?

Gradient Descent

Imagine you are in a canyon and you want to go to the bottom. What is the first thing you would ask before moving?

This is alternate text

Algorithm:

By defining \(\gamma\) a certain learning rate (the size of the step on the hill)

\[ \begin{align} \Theta_{n+1} : = \Theta_{n} - \gamma \bigtriangledown_{\theta} \end{align} \]

The right step size

Example

## Create Data
set.seed(1)
x <- matrix(rnorm(1000*50), 1000, 50)
theta <- array(1, c(50,1))
y <- 1 + x%*%theta + rnorm(1000)
X <- cbind(1, x)

y <- ifelse(y>0, 1, 0)


gradd_logit <- function(y, X, theta0, epsilon = 1e-07, a = .1, 
                        lambda = 0) {
  ## X is n x k matrix
  ## y is n x 1 vector
  ## epsilon: criterion --> when the update is smaller the epsilon you quit
  ## theta0: initial value
  #theta <- array(0, c(3,J))  ## 2 x iterations
  #theta[,1] <- theta0
  Delta <- 1000000
  theta_old <- theta0
  n <- NROW(X)
  iter <- 0
  while (Delta > epsilon) {
    Xt <- X%*%theta_old
    ht <- c(1/(1+exp(Xt)))
    DJ <- -colMeans(ht*X*c(exp(Xt)*(y-1)+y))
    theta <- theta_old - a*DJ - a*lambda*2*theta_old
    Delta <- max((theta-theta_old)^2)
    # cat("Iteration:", iter, "Delta:", Delta, "\n")
    theta_old <- theta
    iter <- iter + 1
  }
  theta
}

gradd_logit(y,X, rep(0, length=51))
##  [1] 0.6961409 0.6624840 0.6513182 0.6205348 0.7732304 0.6859719 0.6717158
##  [8] 0.7823584 0.8242944 0.5875282 0.7687820 0.6830162 0.6748320 0.6573575
## [15] 0.5545125 0.6845484 0.6751578 0.7158370 0.6999495 0.6891910 0.6396030
## [22] 0.7820009 0.7428510 0.7097338 0.6791514 0.7120639 0.8994722 0.5573076
## [29] 0.6749660 0.7539325 0.6058658 0.8019259 0.6894339 0.5780761 0.7512045
## [36] 0.6801436 0.8191608 0.8356648 0.7342417 0.8068681 0.6912554 0.6792209
## [43] 0.6108258 0.7577239 0.6536681 0.7806451 0.8238721 0.6235609 0.6930055
## [50] 0.7707701 0.6105603


There many other possible solutions (you may want to use gradient ascent maximizing the log-likelhood function for example).

Exercise: OLS

We will try to estimate the coefficients for an OLS regression using the same methodology.

First answer to the following questions:

Then:

Optional Challenge

Solution

  1. The objective function is the MSE:

\[ \hat{\beta} = \underset{\hat{\beta}}{argmin} \sum_{i=1}^n (y_i - X_i )^2 \]

  1. I would use gradient descent with

\[ \bigtriangledown_{\beta} = - 2 \sum_{i=1}^n X_i (y_i - X_i \beta) \]

atac <- read.table("./data/atac.txt")
library(pracma)
## Warning: package 'pracma' was built under R version 3.2.5
## Linear Projection
ptm <- proc.time()
X <- cbind(1, atac$tempo_attesa_secondi.y)
y <- atac$traveltime
beta <- mldivide(t(X)%*%X, t(X)%*%y)  ## <<<---- Could you tell why mldivide and not simple division?
beta
##              [,1]
## [1,] 0.0007391378
## [2,] 1.0875441230
## Gradient Descent

alpha <- exp(-20)
iter <- 1

theta <- matrix(c(0,0), nrow=2)
while (iter < 1000) {
      error <- (y - X %*% theta)
      delta <-  - t(X) %*% (error) / length(y)
      theta_old <- theta
      theta <- theta - alpha* delta
      iter <- iter + 1
    }
theta
##              [,1]
## [1,] 0.0007406195
## [2,] 1.0514639816
## Time of performance
    
    proc.time() - ptm
##    user  system elapsed 
##   0.523   0.076   0.641

New Loss:

Two parameter function: \[ \Lambda_i(y,\hat{f} ) = [\alpha + (1 - 2 \alpha) I(y_i - \hat{f}(x_i) < 0)] | y_i - \hat{f}(x_i)|^{\rho} \]

With \(\alpha = 0.33\) and \(\rho = 2\)

The gradient has the following form:

\[ - 2 [ \alpha \sum_{i=1}^n X_i (y_i - X_i \beta)I(y_i < X_i \beta) + (1 - \alpha) \sum_{i=1}^n X_i (y_i - X_i \beta)I(y_i > X_i \beta)] \]

Why do you think we chose \(\rho = 2\)?

Solution:

  x <- atac$tempo_attesa_secondi.y
  learn <- 0.5/1000000
    num_iters <- 1000
alpha = 0.33
    # keep history
    cost_history <- double(num_iters)
    theta_history <- list(num_iters)
    
    # initialize coefficients
    theta <- matrix(c(0,0), nrow=2)
 
    cost <- function(X, y, theta) {
      error <- (X %*% theta - y)
      w <- rep(1, length(y))
      w[error>=0] <- 0.33
      w[error<0] <- 1 - 0.33
      sum( w*(X %*% theta - y)^2 ) / (2*length(y))
    }
    # add a column of 1's for the intercept coefficient
    X <- cbind(1, matrix(x))
    w <- rep(1, length(y))
    # gradient descent
    for (i in 1:num_iters) {
      w <- rep(1, length(y))
      error <- (X %*% theta - y)
      w[error>=0] <- 0.33
      w[error < 0] <- 1 - 0.33
      delta <- t(X) %*% (w*error) / length(y)
      theta <- theta - learn* delta
      cost_history[i] <- cost(X, y, theta)
      theta_history[[i]] <- theta
    }
    
    # plot data and converging fit
    plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Gradient Descent con Asymmetric Loss', ylab="traveltime", xlab="tempo attesa secondi")
    for (i in c(1:num_iters)) {
      abline(coef=theta_history[[i]], col="red")
    }
    abline(coef=theta, col="black", lwd=2)
    legend("topleft", legend=c("Convergence","interactions"),        
           cex=1, lwd=c(2.5,2.5),col=c("black","red"))

    # check out the trajectory of the cost function
    cost_history[seq(1,num_iters, by=100)]
##  [1] 152187.65  16283.39  16283.38  16283.38  16283.37  16283.37  16283.36
##  [8]  16283.36  16283.35  16283.35
    plot(cost_history, type='l', col='blue', lwd=2, main='Cost function', ylab='cost', xlab='Iterations')

And you may want to play with this App…

Stochastic Optimization

We won’t enter in details with this topic, but it is worth to mention few points.

You may incurr in some troubles with gradient descent (or ascent), especially when you have tons of observations. In fact gradient descent requires to compute the gradient with all the observations before doing each step. But what if this is a long-lasting computation?

To save time you might decide to update your parameters in the following way:

\[ \theta : = \theta_{old} - \gamma \bigtriangledown_{\theta, i} \]

Where \(\bigtriangledown_{\theta, i}\) is the gradient at the ith observation for the parameters \(\theta\).

This is alternate text

## Stochastic Gradient with Atac : Asymmetric Loss
ptm <- proc.time()

 learn <- 0.5/1000000
    num_iters <- 1000
alpha = 0.33
    # keep history
    cost_history <- double(num_iters)
    theta_history <- list(num_iters)
    
    # initialize coefficients
    theta <- matrix(c(0,0), nrow=2)
 
      ii <- sample(1:nrow(X), size=1, replace=FALSE) ## <<--  You may want to increase the size!
      
    X <- cbind(1, matrix(x[ii]))
    w <- rep(1, length(y[ii]))
    # gradient descent
    for (i in 1:num_iters) {
       ii <- sample(1:nrow(X), size=1, replace=FALSE) ## <<--  You may want to increase the size!
      w <- rep(1, length(y[ii]))
      error <- (X[ii,] %*% theta - y[ii])
      w[error>=0] <- 0.33
      w[error < 0] <- 1 - 0.33
      delta <- X[ii,] %*% (w*error) / length(y[ii])
      theta <- theta - learn* delta
    }
    theta
##             [,1]
## [1,] 0.007842034
## [2,] 3.372074786
    ## Time of performance
    
    proc.time() - ptm
##    user  system elapsed 
##   0.041   0.002   0.044

Classification

## Generate the data 

set.seed(123)
x1 <- runif(50, 1, 9)
x2 <- runif(50, 0, 10)
y <- rep(0, 50)
y[(x1 >= 6 & x1<=9 & x2 >= 6) | (x1 > 8 & x2 < 2)] <- 1
data <- as.data.frame(cbind(x1, x2 , y))
names(data) <- c("x1", "x2", "y")
library(nnet)
a.log <- multinom(y ~ x1 + x2 , data=data, maxit=5000)
## # weights:  4 (3 variable)
## initial  value 34.657359 
## iter  10 value 13.204275
## final  value 13.186306 
## converged
pr.log <- predict(a.log, newdata=data, type="class")
cer <- mean(data$y != pr.log)
cer
## [1] 0.16
## Repeat the process with test and training set

ii <- sample(1:length(y), size= floor(length(y)/3), replace=FALSE)
test <- data[ii,]
train <- data[-ii,]

a.log <- multinom(y ~ x1 + x2 , data=train, maxit=5000)
## # weights:  4 (3 variable)
## initial  value 23.567004 
## iter  10 value 8.637172
## iter  20 value 8.584837
## final  value 8.584834 
## converged
pr.log <- predict(a.log, newdata=test, type="class")

mean(test$y != pr.log) 
## [1] 0.1875

And looking at the probabilities…

Exercise

Open the data set red wines, add the variable y, equal to one if quality is greater or equal to 6, zero otherwise.

First you are required to use the following covariates: chlorides and volatile acidity.

Hint: use a 0/1 loss function

You are required to use the package nnet.

Then:

Optional:

red <- read.table("./data/wine-red.txt")
library(nnet)
y <- rep(0, dim(red)[1] )
y[red$quality>=6] <- 1
a.log <- multinom(y ~ volatile.acidity + chlorides , data=red, maxit=5000)
## # weights:  4 (3 variable)
## initial  value 1108.342342 
## final  value 1009.381570 
## converged
pr.log <- predict(a.log, newdata=red, type="class")
cer <- mean(y != pr.log)
cer
## [1] 0.3539712
## Graphically

xv <- seq(0, 1, length=300)
xr <- seq(0, 1, length=300)
xx <- expand.grid(xv, xr)
names(xx) <- c("volatile.acidity", "chlorides")
a.log <- glm(y ~ volatile.acidity + chlorides , data=red, family="binomial")
pr.log <- predict(a.log, newdata=xx, type="response")
image(xr, xv, matrix(pr.log, 300, 300), col=terrain.colors(100),
      ylab="chlorides", xlab="volatile.acidity", main='Logistic')
points(volatile.acidity ~ chlorides, data=red, pch=19, cex=0.5, 
       col=c('blue','red')[y+1])

legend("topright", legend=c("Bad quality","Good Quality"),cex=1,        
  lwd=c(1,1),lty=c(NA,NA),     pch=c(19,19), col=c("blue","red"))

set.seed(123)
ii <- sample(1:length(y), size= floor(length(y)/3), replace=FALSE)
test <- red[ii,]
train <- red[-ii,]
y.test <- y[ii]
y.train <- y[-ii]
a.log <- multinom(y.train ~ volatile.acidity + chlorides , data=train, maxit=5000)
## # weights:  4 (3 variable)
## initial  value 738.894894 
## final  value 670.975983 
## converged
pr.log <- predict(a.log, newdata=test, type="class")
mean(y.test != pr.log) 
## [1] 0.3508443
## Cross Validation

set.seed(123)

  n <- length(y)
  k <- 10
  ii <- sample(rep(1:k, length= n))
  
  pr.log <- rep(NA, length(y))
  for (j in 1:k){
    hold <- (ii == j)
    train <- (ii != j)
  
    
a.log <- glm(y[train] ~ volatile.acidity + chlorides , data=red[train,], family="binomial")
pr.log[hold] <- predict(a.log, newdata=red[hold,], type="response")}

pr.log[pr.log>= 0.5] <- 1
pr.log[pr.log< 0.5] <- 0
mean(y != pr.log)
## [1] 0.3589744


2.

## Second part of the exercise

y <- rep(0, dim(red)[1] )
y[red$quality>=8] <- 1
a.log <- multinom(y ~ volatile.acidity + chlorides , data=red, maxit=5000)
## # weights:  4 (3 variable)
## initial  value 1108.342342 
## iter  10 value 94.672727
## iter  20 value 92.836052
## iter  30 value 91.591819
## iter  40 value 91.435758
## final  value 91.370717 
## converged
pr.log <- predict(a.log, newdata=red, type="class")
cer <- mean(y != pr.log)
cer
## [1] 0.01125704

Error rate with Cross Validation:

## [1] 0.01125704
## Check the number of wrong good quality
## DO you notice something strange?
table(pr.log, y)
##       y
## pr.log    0    1
##      0 1581   18

Makes sense?

Sensitivity VS Specificity

# sensitivity 
sum( (y == 0) & (pr.log == 0) ) / sum( y == 0)
## [1] 1
# Specificity 
sum( (y == 1) & (pr.log == 1) ) / sum( y == 1)
## [1] 0

Optional:

red$quality <- as.factor(red$quality)
a.log <- multinom(quality~. , data=red, maxit=5000)
## # weights:  78 (60 variable)
## initial  value 2865.023391 
## iter  10 value 1963.486425
## iter  20 value 1795.520694
## iter  30 value 1541.529008
## iter  40 value 1479.891422
## iter  50 value 1471.286160
## iter  60 value 1469.045606
## iter  70 value 1466.890589
## iter  80 value 1466.111779
## iter  90 value 1465.874531
## iter 100 value 1465.552983
## iter 110 value 1465.371954
## iter 120 value 1465.163918
## iter 130 value 1465.064718
## iter 140 value 1465.052367
## iter 150 value 1465.033690
## iter 160 value 1464.908693
## iter 170 value 1464.694613
## iter 180 value 1464.511253
## iter 190 value 1464.266836
## iter 200 value 1464.066525
## iter 210 value 1463.769756
## iter 220 value 1462.994644
## iter 230 value 1461.889203
## iter 240 value 1460.824287
## final  value 1460.811130 
## converged
pr.log <- predict(a.log, newdata=red, type="class")
mean(red$quality != pr.log) 
## [1] 0.3933709
table(pr.log, y)
##       y
## pr.log   0   1
##      3   5   0
##      4   1   0
##      5 771   0
##      6 689  10
##      7 115   8
##      8   0   0

Sensitivity VS Specificity

Let’s go back to the previous example:

With a classification treshold at 0.5 of probability for each class the in-sample error is:

##    pred
## y    0  1
##   0 36  4
##   1  4  6

##    pred
## y    0  1
##   0 29 11
##   1  0 10

##    pred
## y    0
##   0 40
##   1 10

ROC

What would you like to maximize?

LDA and QDA

…Remember the Bayes theorem…

\[ P(Y=c_k | X = x) = \frac{P(X=x | Y=c_k)P(Y=c_k)}{P(X=x)} \]

In a compact form:

\[ P(Y=c_k | X = x) = \frac{f_k(X)p_k}{f(X)} \]

So…

\[ \hat{h}(x) = \underset{k \in K}{argmin}\frac{f_k(X)p_k} \]

Could you tell why?


We can make different assumptions on the distribution of the covariates.

Let’s assume that \(X|Y = c_k \sim N(\mu_k, \Sigma)\) for certain values of mu and sigma. Can we use this assumption to estimate the class of Y?


Given a certain training set we can estimate the parameters:

\[ \hat{f_k} \sim N(\hat{\mu_k}, \hat{\Sigma}) \]

Then by replacing \(f_k\) with \(\hat{f_k}\) the conditions become:

\(f_k p_k > f_j p_j \forall j \neq k\) for each class k. We can solve this problem by maximizing the log-ratio:

\[ log \frac{(f_k p_k)}{(f_j p_j)} = log\frac{(f_k)}{(f_j)} + log\frac{(p_k)}{(p_j)} \]

Then we explicitate the equation of the normal distribution and take out the log:

\[ = log\frac{(p_k)}{(p_j)} - \frac{1}{2} (\mu_k + \mu_j)'\Sigma^{-1}(\mu_k - \mu_j) + X'\Sigma^{-1}(\mu_k - \mu_j) \]

Note that the equation is linear for the covariate matrix X. This is why LDA has linear boundaries (like logistic).

library(MASS)
LDA <- lda(y ~ . , data=data, family="gaussian")
QDA <- qda(y ~ . , data=data, family="gaussian")
pred <- predict(LDA, newdata=data)$posterior[,2]

With QDA instead we assume \[ \hat{f_k} \sim N(\hat{\mu_k}, \hat{\Sigma_k}) \]


Whereas LDA tend to have on average a lower variance it might have an higher bias the QDA, could you explain why?

Note that if the covariance matrix is different for each class then cancellation do not occur and boundaries are now quadratic:

The effect of outliers on the regressions:

Which one?


Remember that LDA and QDA represents the usual trade-off between flexibility and variance. The difference between gaussian and binomial MLE relies on the underlying assumptions.

K-Nearest Neighbours


Question: How could we build local fits for classification?


\[ f(x) = \frac{1}{N_k(x)} \sum_{j \in N_k(x)}y_j $ where $ N_k(x) = {i: ||x - x_i|| < d_{k,x}} \]

Using majority votes, we make local prediction looking at the class most rapresented by the neighbours in that region of space.


Example:


But with outliers:


More neighbours?

library(class)

u10 <- as.numeric(knn(train=data[,c(1, 2)], cl=data$y, test=xx, k=10)) ## <--- See in the help k=?

With Outliers:


Why?


With prob:

ufull <- attr(knn(train=data[,c(1, 2)], cl=data$y, test=xx, k=nrow(data)/2, prob=TRUE), "prob") ### <-- Check for "prob"

Issues:


Classification Group Competition

We have been asked to predict hearth disease and we can control for several variables such as tobacco consumption, adiposity, alcohol consumption, age, obesity, etc.

We have at disposal real data from a research with in total 451 observations. You will be given 401 observations to train your model. The remaining 50 observations are your test set (no dependent variable provided for the test set). The y is the last variable chd.

What to do:

Game:

At the end of the competition the groups that got at least 3 points will present their work to the class.

We expect from each group to use cross validation to select features and models.

Solution

Note: this is not the solution for the best predictor, but only for the first point, we expect from you a much better performance.

## We will select a knn --> You should have done with more predictors and cases

## Note that in this case knn is not the best predictor (too few observations...)

library(class)
heart.test <- read.table("./data/heart_test.txt")
heart.train <- read.table("./data/heart_train.txt")



## lOOK AT THE cer WITH CV
heart.train <- na.omit(heart.train)

heart.train$chd <- as.factor(heart.train$chd)
y <- heart.train$chd
xx <- heart.train[,c(2,7)]

set.seed(123)

  n <- length(y)
  k <- 10
  ii <- sample(rep(1:k, length= n))
  
  pr.knn5 <- pr.knn10 <- pr.knn20 <- pr.knn35 <- pr.knn50 <- rep(NA, length(y))
  for (j in 1:k){

    
    xx.tr <- xx[ii!=j,]
    xx.te <- xx[ii == j, ]
    y.tr <- y[ii!=j]
    
## Standardize
    
  
mu.tr <- colMeans(xx.tr)
sd.tr <- apply(xx.tr, 2, sd)
xx.tr <- scale(xx.tr)

## Standardize using the mu and sd of the training set!


xx.te <- scale(xx.te, center=mu.tr, scale=sd.tr)


u5  <- knn(train=xx.tr, cl=y.tr, test=xx.te, k=5)    
u10 <- knn(train=xx.tr, cl=y.tr, test=xx.te, k=10)
u20 <- knn(train=xx.tr, cl=y.tr, test=xx.te, k=20)
u35 <- knn(train=xx.tr, cl=y.tr, test=xx.te, k=35)
u50 <- knn(train=xx.tr, cl=y.tr, test=xx.te, k=50)
pr.knn5[ii == j] <- as.vector(u5)
pr.knn10[ii== j] <- as.vector(u10)
pr.knn20[ii== j] <- as.vector(u20)
pr.knn35[ii== j] <- as.vector(u35)
pr.knn50[ii== j] <- as.vector(u50)

}


cbind(mean(y != pr.knn5),mean(y != pr.knn10),mean(y != pr.knn20),mean(y != pr.knn35), mean(y != pr.knn50))
##       [,1]   [,2]   [,3]   [,4]  [,5]
## [1,] 0.425 0.3775 0.3325 0.3375 0.335
table(y, pr.knn20)
##    pr.knn20
## y     0   1
##   0 226  29
##   1 104  41
## Prediction on test set

xx <- heart.train[,c(2,7)]
u.tr <- colMeans(xx)
sd.tr <- apply(xx, 2, sd)
xx.tr <- scale(xx)
xx.te <- scale(heart.test[,c(2,7)], center=mu.tr, scale=sd.tr)
u20 <- knn(train=xx.tr, cl=y, test=xx.te, k=20)

## Real results: 28%